%reload_ext autoreload
%autoreload 2
%matplotlib inline
import pandas as pd
import scipy.stats as stats
import plotly.express as px
import numpy as np
from sklearn.model_selection import RepeatedKFold
import statsmodels.api as sm
from statsmodels.tools.eval_measures import mse, rmse
import itertools
In the following section, the two main datasets are imported and some transformations on the data are performed.
The dataset containing the data collected at the end of the experiment is imported.
df_time_results = pd.read_csv('time_results_entries.csv', sep=';')
df_time_results.tail()
| UserID | Method | Class | Time | Correctness | |
|---|---|---|---|---|---|
| 45 | 3 | String unescape(String string) | XML.java | 24 | 1 |
| 46 | 3 | Decoder createDecoder(String cacheKey, Type type) | extra\GsonCompatibilityMode.java | 32 | 1 |
| 47 | 3 | void enocde_(Object obj, JsonStream stream) | output\ReflectionObjectEncoder.java | 40 | 1 |
| 48 | 3 | void enableDecoders() | extra\Base64FloatSupport.java | 33 | 1 |
| 49 | 3 | boolean equals(Object o) | extra\GsonCompatibilityMode.java | 17 | 1 |
df_time_results.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 50 entries, 0 to 49 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 UserID 50 non-null int64 1 Method 50 non-null object 2 Class 50 non-null object 3 Time 50 non-null int64 4 Correctness 50 non-null int64 dtypes: int64(3), object(2) memory usage: 2.1+ KB
Some columns are renamed to maintain a naming convention with the other dataset.
df_time_results.rename(columns={'Method': 'Name', 'Class': 'Path'},
inplace=True)
df_time_results.tail()
| UserID | Name | Path | Time | Correctness | |
|---|---|---|---|---|---|
| 45 | 3 | String unescape(String string) | XML.java | 24 | 1 |
| 46 | 3 | Decoder createDecoder(String cacheKey, Type type) | extra\GsonCompatibilityMode.java | 32 | 1 |
| 47 | 3 | void enocde_(Object obj, JsonStream stream) | output\ReflectionObjectEncoder.java | 40 | 1 |
| 48 | 3 | void enableDecoders() | extra\Base64FloatSupport.java | 33 | 1 |
| 49 | 3 | boolean equals(Object o) | extra\GsonCompatibilityMode.java | 17 | 1 |
The dataset containing the methods chosen during the planning phase of the experiment is imported. It also contains all the software metrics related to the chosen methods.
df_chosen_methods_statistics = pd.read_csv('experiment_all_chosen_methods.csv')
df_chosen_methods_statistics.tail()
| Module Position | Module Complexity | McCC | TCLOC | LOC | TLLOC | NL | Name | Path | |
|---|---|---|---|---|---|---|---|---|---|
| 27 | 394 | 21 | 10 | 10 | 70 | 59 | 6 | String toString(JSONArray ja) | JSONML.java |
| 28 | 1522 | 36 | 17 | 13 | 40 | 33 | 7 | void populateMap(Object bean) | JSONObject.java |
| 29 | 89 | 12 | 8 | 28 | 44 | 33 | 2 | JSONObject toJSONObject(String string) | Cookie.java |
| 30 | 133 | 6 | 7 | 6 | 12 | 12 | 1 | int dehexchar(char c) | JSONTokener.java |
| 31 | 1231 | 11 | 13 | 12 | 36 | 30 | 2 | BigInteger objectToBigInteger(Object val, BigI... | JSONObject.java |
df_chosen_methods_statistics.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 32 entries, 0 to 31 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Module Position 32 non-null int64 1 Module Complexity 32 non-null int64 2 McCC 32 non-null int64 3 TCLOC 32 non-null int64 4 LOC 32 non-null int64 5 TLLOC 32 non-null int64 6 NL 32 non-null int64 7 Name 32 non-null object 8 Path 32 non-null object dtypes: int64(7), object(2) memory usage: 2.4+ KB
The two imported dataset are merged based on the name of the method. This operation equals a left-join operation in the SQL language.
df_merged_statistics_results = \
df_time_results.merge(df_chosen_methods_statistics,
how='left', left_on='Name', right_on='Name')
df_merged_statistics_results.tail()
| UserID | Name | Path_x | Time | Correctness | Module Position | Module Complexity | McCC | TCLOC | LOC | TLLOC | NL | Path_y | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 45 | 3 | String unescape(String string) | XML.java | 24 | 1 | 199 | 8 | 4 | 11 | 23 | 19 | 3 | XML.java |
| 46 | 3 | Decoder createDecoder(String cacheKey, Type type) | extra\GsonCompatibilityMode.java | 32 | 1 | 334 | 46 | 8 | 0 | 26 | 112 | 7 | extra\GsonCompatibilityMode.java |
| 47 | 3 | void enocde_(Object obj, JsonStream stream) | output\ReflectionObjectEncoder.java | 40 | 1 | 65 | 21 | 10 | 0 | 42 | 42 | 4 | output\ReflectionObjectEncoder.java |
| 48 | 3 | void enableDecoders() | extra\Base64FloatSupport.java | 33 | 1 | 98 | 12 | 1 | 0 | 10 | 50 | 0 | extra\Base64FloatSupport.java |
| 49 | 3 | boolean equals(Object o) | extra\GsonCompatibilityMode.java | 17 | 1 | 171 | 5 | 15 | 0 | 21 | 19 | 1 | extra\GsonCompatibilityMode.java |
The UserID column is converted into a String. In this way, it can be considered as a categorical variable.
df_merged_statistics_results['UserID'] = \
df_merged_statistics_results['UserID'].astype(str)
df_merged_statistics_results.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 50 entries, 0 to 49 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 UserID 50 non-null object 1 Name 50 non-null object 2 Path_x 50 non-null object 3 Time 50 non-null int64 4 Correctness 50 non-null int64 5 Module Position 50 non-null int64 6 Module Complexity 50 non-null int64 7 McCC 50 non-null int64 8 TCLOC 50 non-null int64 9 LOC 50 non-null int64 10 TLLOC 50 non-null int64 11 NL 50 non-null int64 12 Path_y 50 non-null object dtypes: int64(9), object(4) memory usage: 5.5+ KB
df_merged_statistics_results.rename(
columns={'Module Complexity': 'Cognitive Complexity',
'Path_x': 'Path'},
inplace=True)
df_merged_statistics_results.head()
| UserID | Name | Path | Time | Correctness | Module Position | Cognitive Complexity | McCC | TCLOC | LOC | TLLOC | NL | Path_y | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | Object stringToValue(String string) | JSONObject.java | 49 | 1 | 2267 | 9 | 9 | 17 | 30 | 22 | 2 | JSONObject.java |
| 1 | 1 | Writer write(Writer writer, int indentFactor, ... | JSONObject.java | 26 | 1 | 2565 | 23 | 12 | 27 | 54 | 53 | 5 | JSONObject.java |
| 2 | 1 | BigInteger objectToBigInteger(Object val, BigI... | JSONObject.java | 35 | 1 | 1231 | 11 | 13 | 12 | 36 | 30 | 2 | JSONObject.java |
| 3 | 1 | Object nextMeta() | XMLTokener.java | 27 | 1 | 200 | 18 | 25 | 12 | 56 | 56 | 3 | XMLTokener.java |
| 4 | 1 | int findStringEnd(JsonIterator iter) | IterImplSkip.java | 22 | 1 | 59 | 24 | 10 | 7 | 30 | 26 | 5 | IterImplSkip.java |
df_merged_statistics_results.columns
Index(['UserID', 'Name', 'Path', 'Time', 'Correctness', 'Module Position',
'Cognitive Complexity', 'McCC', 'TCLOC', 'LOC', 'TLLOC', 'NL',
'Path_y'],
dtype='object')
Columns description:
A check on the number of null data-points in the merged dataset is performed.
print(df_merged_statistics_results.isna().sum())
UserID 0 Name 0 Path 0 Time 0 Correctness 0 Module Position 0 Cognitive Complexity 0 McCC 0 TCLOC 0 LOC 0 TLLOC 0 NL 0 Path_y 0 dtype: int64
df_merged_statistics_results.to_csv('merged_experiment_results.csv',
index=False)
A brief correlation analysis on the dataset as-is is performed.
df_experiment_statistics = df_merged_statistics_results.copy()
df_experiment_statistics
| UserID | Name | Path | Time | Correctness | Module Position | Cognitive Complexity | McCC | TCLOC | LOC | TLLOC | NL | Path_y | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | Object stringToValue(String string) | JSONObject.java | 49 | 1 | 2267 | 9 | 9 | 17 | 30 | 22 | 2 | JSONObject.java |
| 1 | 1 | Writer write(Writer writer, int indentFactor, ... | JSONObject.java | 26 | 1 | 2565 | 23 | 12 | 27 | 54 | 53 | 5 | JSONObject.java |
| 2 | 1 | BigInteger objectToBigInteger(Object val, BigI... | JSONObject.java | 35 | 1 | 1231 | 11 | 13 | 12 | 36 | 30 | 2 | JSONObject.java |
| 3 | 1 | Object nextMeta() | XMLTokener.java | 27 | 1 | 200 | 18 | 25 | 12 | 56 | 56 | 3 | XMLTokener.java |
| 4 | 1 | int findStringEnd(JsonIterator iter) | IterImplSkip.java | 22 | 1 | 59 | 24 | 10 | 7 | 30 | 26 | 5 | IterImplSkip.java |
| 5 | 1 | int parse(JsonIterator iter) | IterImplString.java | 18 | 1 | 68 | 6 | 5 | 6 | 28 | 23 | 2 | IterImplString.java |
| 6 | 1 | Slice readSlice(JsonIterator iter) | IterImplForStreaming.java | 21 | 1 | 199 | 8 | 6 | 3 | 34 | 32 | 2 | IterImplForStreaming.java |
| 7 | 1 | Decoder createDecoder(String cacheKey, Type type) | extra/GsonCompatibilityMode.java | 24 | 1 | 334 | 46 | 8 | 0 | 26 | 112 | 7 | extra\GsonCompatibilityMode.java |
| 8 | 2 | Writer write(Writer writer, int indentFactor, ... | JSONObject.java | 32 | 1 | 2565 | 23 | 12 | 27 | 54 | 53 | 5 | JSONObject.java |
| 9 | 2 | Object nextValue() | JSONTokener.java | 77 | 1 | 421 | 5 | 9 | 15 | 40 | 28 | 1 | JSONTokener.java |
| 10 | 2 | Object nextToken() | XMLTokener.java | 31 | 1 | 268 | 22 | 28 | 12 | 74 | 68 | 3 | XMLTokener.java |
| 11 | 2 | String unescape(String string) | XML.java | 20 | 1 | 199 | 8 | 4 | 11 | 23 | 19 | 3 | XML.java |
| 12 | 2 | String updateBindingSetOp(String rendered, Bin... | CodegenImplObjectStrict.java | 10 | 1 | 202 | 15 | 8 | 3 | 37 | 34 | 3 | CodegenImplObjectStrict.java |
| 13 | 2 | void writeFloat(JsonStream stream, float val) | output/StreamImplNumber.java | 44 | 1 | 213 | 10 | 9 | 1 | 35 | 35 | 2 | output\StreamImplNumber.java |
| 14 | 2 | Any fillCacheUntil(int target) | any/ArrayLazyAny.java | 26 | 1 | 158 | 12 | 10 | 0 | 43 | 43 | 3 | any\ArrayLazyAny.java |
| 15 | 2 | Decoder createDecoder(String cacheKey, Type type) | extra/GsonCompatibilityMode.java | 25 | 1 | 334 | 46 | 8 | 0 | 26 | 112 | 7 | extra\GsonCompatibilityMode.java |
| 16 | 3 | Object nextMeta() | XMLTokener.java | 38 | 1 | 200 | 18 | 25 | 12 | 56 | 56 | 3 | XMLTokener.java |
| 17 | 3 | Object nextValue() | JSONTokener.java | 30 | 1 | 421 | 5 | 9 | 15 | 40 | 28 | 1 | JSONTokener.java |
| 18 | 3 | Object wrap(Object object) | JSONObject.java | 13 | 1 | 2433 | 10 | 26 | 12 | 40 | 39 | 2 | JSONObject.java |
| 19 | 3 | void populateMap(Object bean) | JSONObject.java | 33 | 1 | 1522 | 36 | 17 | 13 | 40 | 33 | 7 | JSONObject.java |
| 20 | 3 | int findStringEnd(JsonIterator iter) | IterImplSkip.java | 29 | 1 | 59 | 24 | 10 | 7 | 30 | 26 | 5 | IterImplSkip.java |
| 21 | 3 | void skipArray(JsonIterator iter) | IterImplForStreaming.java | 14 | 1 | 61 | 12 | 8 | 5 | 29 | 27 | 4 | IterImplForStreaming.java |
| 22 | 3 | Any fillCacheUntil(int target) | any\ArrayLazyAny.java | 32 | 1 | 158 | 12 | 10 | 0 | 43 | 43 | 3 | any\ArrayLazyAny.java |
| 23 | 3 | void skipFixedBytes(JsonIterator iter, int n) | IterImplForStreaming.java | 16 | 1 | 362 | 6 | 4 | 0 | 14 | 14 | 3 | IterImplForStreaming.java |
| 24 | 4 | Decoder createDecoder(String cacheKey, Type type) | extra/GsonCompatibilityMode.java | 28 | 1 | 334 | 46 | 8 | 0 | 26 | 112 | 7 | extra\GsonCompatibilityMode.java |
| 25 | 5 | Object nextToken() | XMLTokener.java | 40 | 1 | 268 | 22 | 28 | 12 | 74 | 68 | 3 | XMLTokener.java |
| 26 | 1 | String toString(JSONArray ja) | JSONML.java | 23 | 1 | 394 | 21 | 10 | 10 | 70 | 59 | 6 | JSONML.java |
| 27 | 1 | String unescape(String string) | XML.java | 19 | 1 | 199 | 8 | 4 | 11 | 23 | 19 | 3 | XML.java |
| 28 | 1 | void populateMap(Object bean) | JSONObject.java | 41 | 1 | 1522 | 36 | 17 | 13 | 40 | 33 | 7 | JSONObject.java |
| 29 | 1 | JSONObject toJSONObject(String string) | Cookie.java | 21 | 1 | 89 | 12 | 8 | 28 | 44 | 33 | 2 | Cookie.java |
| 30 | 1 | void enableDecoders() | extra/Base64FloatSupport.java | 37 | 1 | 98 | 12 | 1 | 0 | 10 | 50 | 0 | extra\Base64FloatSupport.java |
| 31 | 1 | void writeRaw(String val, int remaining) | output/JsonStream.java | 16 | 1 | 154 | 5 | 4 | 0 | 25 | 25 | 2 | output\JsonStream.java |
| 32 | 1 | Any fillCacheUntil(int target) | any/ArrayLazyAny.java | 29 | 1 | 158 | 12 | 10 | 0 | 43 | 43 | 3 | any\ArrayLazyAny.java |
| 33 | 1 | Object read() | JsonIterator.java | 25 | 1 | 287 | 11 | 13 | 0 | 42 | 42 | 4 | JsonIterator.java |
| 34 | 2 | String toString(JSONArray ja) | JSONML.java | 27 | 1 | 394 | 21 | 10 | 10 | 70 | 59 | 6 | JSONML.java |
| 35 | 2 | BigInteger objectToBigInteger(Object val, BigI... | JSONObject.java | 22 | 1 | 1231 | 11 | 13 | 12 | 36 | 30 | 2 | JSONObject.java |
| 36 | 2 | JSONArray(JSONTokener x) | JSONArray.java | 28 | 1 | 106 | 20 | 11 | 11 | 44 | 40 | 4 | JSONArray.java |
| 37 | 2 | float getFloat(int index) | JSONArray.java | 9 | 1 | 326 | 2 | 3 | 10 | 11 | 11 | 1 | JSONArray.java |
| 38 | 2 | Object read() | JsonIterator.java | 32 | 1 | 287 | 11 | 13 | 0 | 42 | 42 | 4 | JsonIterator.java |
| 39 | 2 | int parse(JsonIterator iter) | IterImplString.java | 22 | 1 | 68 | 6 | 5 | 6 | 28 | 23 | 2 | IterImplString.java |
| 40 | 2 | boolean equals(Object o) | extra/GsonCompatibilityMode.java | 19 | 1 | 171 | 5 | 15 | 0 | 21 | 19 | 1 | extra\GsonCompatibilityMode.java |
| 41 | 2 | void skipFixedBytes(JsonIterator iter, int n) | IterImplForStreaming.java | 16 | 1 | 362 | 6 | 4 | 0 | 14 | 14 | 3 | IterImplForStreaming.java |
| 42 | 3 | boolean similar(Object other) | JSONObject.java | 52 | 1 | 2085 | 23 | 15 | 8 | 39 | 39 | 6 | JSONObject.java |
| 43 | 3 | Writer write(Writer writer, int indentFactor, ... | JSONObject.java | 38 | 1 | 2565 | 23 | 12 | 27 | 54 | 53 | 5 | JSONObject.java |
| 44 | 3 | JSONArray(JSONTokener x) | JSONArray.java | 25 | 1 | 106 | 20 | 11 | 11 | 44 | 40 | 4 | JSONArray.java |
| 45 | 3 | String unescape(String string) | XML.java | 24 | 1 | 199 | 8 | 4 | 11 | 23 | 19 | 3 | XML.java |
| 46 | 3 | Decoder createDecoder(String cacheKey, Type type) | extra\GsonCompatibilityMode.java | 32 | 1 | 334 | 46 | 8 | 0 | 26 | 112 | 7 | extra\GsonCompatibilityMode.java |
| 47 | 3 | void enocde_(Object obj, JsonStream stream) | output\ReflectionObjectEncoder.java | 40 | 1 | 65 | 21 | 10 | 0 | 42 | 42 | 4 | output\ReflectionObjectEncoder.java |
| 48 | 3 | void enableDecoders() | extra\Base64FloatSupport.java | 33 | 1 | 98 | 12 | 1 | 0 | 10 | 50 | 0 | extra\Base64FloatSupport.java |
| 49 | 3 | boolean equals(Object o) | extra\GsonCompatibilityMode.java | 17 | 1 | 171 | 5 | 15 | 0 | 21 | 19 | 1 | extra\GsonCompatibilityMode.java |
In order to compute the correlation degree between the features of the dataset, the kendall-tau index is computed.
df_experiment_statistics.corr(method='kendall')
| Time | Correctness | Module Position | Cognitive Complexity | McCC | TCLOC | LOC | TLLOC | NL | |
|---|---|---|---|---|---|---|---|---|---|
| Time | 1.000000 | NaN | 0.178781 | 0.298144 | 0.275667 | 0.182890 | 0.271094 | 0.307633 | 0.145350 |
| Correctness | NaN | 1.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Module Position | 0.178781 | NaN | 1.000000 | 0.106443 | 0.235163 | 0.330922 | 0.146405 | 0.143698 | 0.165382 |
| Cognitive Complexity | 0.298144 | NaN | 0.106443 | 1.000000 | 0.242590 | 0.080933 | 0.324232 | 0.584417 | 0.702600 |
| McCC | 0.275667 | NaN | 0.235163 | 0.242590 | 1.000000 | 0.262127 | 0.568215 | 0.266956 | 0.182302 |
| TCLOC | 0.182890 | NaN | 0.330922 | 0.080933 | 0.262127 | 1.000000 | 0.369297 | -0.039485 | -0.007637 |
| LOC | 0.271094 | NaN | 0.146405 | 0.324232 | 0.568215 | 0.369297 | 1.000000 | 0.547067 | 0.246081 |
| TLLOC | 0.307633 | NaN | 0.143698 | 0.584417 | 0.266956 | -0.039485 | 0.547067 | 1.000000 | 0.372242 |
| NL | 0.145350 | NaN | 0.165382 | 0.702600 | 0.182302 | -0.007637 | 0.246081 | 0.372242 | 1.000000 |
Correctness shows NaN values in the correlation with each metric. Since it is modeled as an integer variable, the other possibility for this behavior is in the variation of its values.
df_experiment_statistics['Correctness'].unique()
array([1], dtype=int64)
Since all the values in Correctness equal 1, the correlation with other variables cannot be computed. Indeed, the standard deviation of the values in Correctness equals 0, that is the denominator of the function that computes the correlation.
As a result, it can be dropped.
df_experiment_statistics.drop(columns=['Correctness'],
inplace=True)
df_experiment_statistics.head()
| UserID | Name | Path | Time | Module Position | Cognitive Complexity | McCC | TCLOC | LOC | TLLOC | NL | Path_y | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | Object stringToValue(String string) | JSONObject.java | 49 | 2267 | 9 | 9 | 17 | 30 | 22 | 2 | JSONObject.java |
| 1 | 1 | Writer write(Writer writer, int indentFactor, ... | JSONObject.java | 26 | 2565 | 23 | 12 | 27 | 54 | 53 | 5 | JSONObject.java |
| 2 | 1 | BigInteger objectToBigInteger(Object val, BigI... | JSONObject.java | 35 | 1231 | 11 | 13 | 12 | 36 | 30 | 2 | JSONObject.java |
| 3 | 1 | Object nextMeta() | XMLTokener.java | 27 | 200 | 18 | 25 | 12 | 56 | 56 | 3 | XMLTokener.java |
| 4 | 1 | int findStringEnd(JsonIterator iter) | IterImplSkip.java | 22 | 59 | 24 | 10 | 7 | 30 | 26 | 5 | IterImplSkip.java |
df_experiment_statistics.corr(method='kendall')
| Time | Module Position | Cognitive Complexity | McCC | TCLOC | LOC | TLLOC | NL | |
|---|---|---|---|---|---|---|---|---|
| Time | 1.000000 | 0.178781 | 0.298144 | 0.275667 | 0.182890 | 0.271094 | 0.307633 | 0.145350 |
| Module Position | 0.178781 | 1.000000 | 0.106443 | 0.235163 | 0.330922 | 0.146405 | 0.143698 | 0.165382 |
| Cognitive Complexity | 0.298144 | 0.106443 | 1.000000 | 0.242590 | 0.080933 | 0.324232 | 0.584417 | 0.702600 |
| McCC | 0.275667 | 0.235163 | 0.242590 | 1.000000 | 0.262127 | 0.568215 | 0.266956 | 0.182302 |
| TCLOC | 0.182890 | 0.330922 | 0.080933 | 0.262127 | 1.000000 | 0.369297 | -0.039485 | -0.007637 |
| LOC | 0.271094 | 0.146405 | 0.324232 | 0.568215 | 0.369297 | 1.000000 | 0.547067 | 0.246081 |
| TLLOC | 0.307633 | 0.143698 | 0.584417 | 0.266956 | -0.039485 | 0.547067 | 1.000000 | 0.372242 |
| NL | 0.145350 | 0.165382 | 0.702600 | 0.182302 | -0.007637 | 0.246081 | 0.372242 | 1.000000 |
It is worth noting that Cognitive Complexity and McCC show a low degree of correlation (0.243). Thus, it can be interesting to investigate which metric shows a stronger correlation with Time.
Also, LOC and TLLOC seem to be quite different from a quantitative point of view (correlation of 0.547). Thus, even in this case, it can be interesting to investigate which metric shows a stronger correlation with Time. Finally, the analysis of TCLOC can provide insights on the usefulness of the comments in order to understand the source code and locate the defects.
Some columns are dropped because they are not considered during the correlation analysis.
df_experiment_statistics.drop(columns=['Module Position',
'Path_y'],
inplace=True)
df_experiment_statistics.head()
| UserID | Name | Path | Time | Cognitive Complexity | McCC | TCLOC | LOC | TLLOC | NL | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | Object stringToValue(String string) | JSONObject.java | 49 | 9 | 9 | 17 | 30 | 22 | 2 |
| 1 | 1 | Writer write(Writer writer, int indentFactor, ... | JSONObject.java | 26 | 23 | 12 | 27 | 54 | 53 | 5 |
| 2 | 1 | BigInteger objectToBigInteger(Object val, BigI... | JSONObject.java | 35 | 11 | 13 | 12 | 36 | 30 | 2 |
| 3 | 1 | Object nextMeta() | XMLTokener.java | 27 | 18 | 25 | 12 | 56 | 56 | 3 |
| 4 | 1 | int findStringEnd(JsonIterator iter) | IterImplSkip.java | 22 | 24 | 10 | 7 | 30 | 26 | 5 |
Some functions for the kendall-tau computation, and the related p-value, are defined.
def print_kendall_correlation(df, feature_1, feature_2):
print(f'Correlation between {feature_1} and {feature_2}:')
print(stats.kendalltau(df[feature_1], df[feature_2]))
def print_all_kendall_correlations(df):
print_kendall_correlation(df, 'Cognitive Complexity', 'Time')
print()
print_kendall_correlation(df, 'McCC', 'Time')
print()
print_kendall_correlation(df, 'LOC', 'Time')
print()
print_kendall_correlation(df, 'TCLOC', 'Time')
print()
print_kendall_correlation(df, 'TLLOC', 'Time')
print()
print_kendall_correlation(df, 'NL', 'Time')
print_all_kendall_correlations(df_experiment_statistics)
Correlation between Cognitive Complexity and Time: KendalltauResult(correlation=0.2981436069910109, pvalue=0.003169684615873828) Correlation between McCC and Time: KendalltauResult(correlation=0.2756671500231326, pvalue=0.006683927833596138) Correlation between LOC and Time: KendalltauResult(correlation=0.27109376363932897, pvalue=0.0067316173454244595) Correlation between TCLOC and Time: KendalltauResult(correlation=0.1828899735871646, pvalue=0.07880577563232574) Correlation between TLLOC and Time: KendalltauResult(correlation=0.3076331246861424, pvalue=0.002080338510739423) Correlation between NL and Time: KendalltauResult(correlation=0.14535043640755838, pvalue=0.16492818562000944)
Some functions are defined to display the scatterplot of two features and the related line representing a linear regression based on the data.
def show_scatter_correlation(df, feature_1, feature_2):
fig = px.scatter(df,
x=feature_1,
y=feature_2,
trendline='ols')
fig.show()
def show_all_scatter_correlations(df):
show_scatter_correlation(df,
'Cognitive Complexity',
'Time')
show_scatter_correlation(df,
'McCC',
'Time')
show_scatter_correlation(df,
'LOC',
'Time')
show_scatter_correlation(df,
'TCLOC',
'Time')
show_scatter_correlation(df,
'TLLOC',
'Time')
show_scatter_correlation(df,
'NL',
'Time')
show_all_scatter_correlations(df_experiment_statistics)
A function is defined to display the histograms of the considered features.
def show_software_metrics_stats(df):
fig = px.histogram(df, x='Cognitive Complexity')
fig.show()
fig = px.histogram(df, x='McCC')
fig.show()
fig = px.histogram(df, x='LOC')
fig.show()
fig = px.histogram(df, x='TCLOC')
fig.show()
fig = px.histogram(df, x='TLLOC')
fig.show()
fig = px.histogram(df, x='NL')
fig.show()
show_software_metrics_stats(df_experiment_statistics)
This section analyzes the correlation between Time and Cognitive Complexity in more details.
df_cognitive_complexity_time = df_experiment_statistics.copy()
df_cognitive_complexity_time.head()
| UserID | Name | Path | Time | Cognitive Complexity | McCC | TCLOC | LOC | TLLOC | NL | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | Object stringToValue(String string) | JSONObject.java | 49 | 9 | 9 | 17 | 30 | 22 | 2 |
| 1 | 1 | Writer write(Writer writer, int indentFactor, ... | JSONObject.java | 26 | 23 | 12 | 27 | 54 | 53 | 5 |
| 2 | 1 | BigInteger objectToBigInteger(Object val, BigI... | JSONObject.java | 35 | 11 | 13 | 12 | 36 | 30 | 2 |
| 3 | 1 | Object nextMeta() | XMLTokener.java | 27 | 18 | 25 | 12 | 56 | 56 | 3 |
| 4 | 1 | int findStringEnd(JsonIterator iter) | IterImplSkip.java | 22 | 24 | 10 | 7 | 30 | 26 | 5 |
Only the considered metrics are selected.
df_cognitive_complexity_time = df_cognitive_complexity_time[['Cognitive Complexity',
'Time']]
df_cognitive_complexity_time.head()
| Cognitive Complexity | Time | |
|---|---|---|
| 0 | 9 | 49 |
| 1 | 23 | 26 |
| 2 | 11 | 35 |
| 3 | 18 | 27 |
| 4 | 24 | 22 |
The correlation degree of the initial dataset is replicated.
print_kendall_correlation(df_cognitive_complexity_time,
'Cognitive Complexity',
'Time')
Correlation between Cognitive Complexity and Time: KendalltauResult(correlation=0.2981436069910109, pvalue=0.003169684615873828)
A function is defined to display a boxplot of each single feature.
def show_feature_boxplot(df, feature):
fig = px.box(df, y=feature)
fig.show()
show_feature_boxplot(df_cognitive_complexity_time, 'Cognitive Complexity')
show_feature_boxplot(df_cognitive_complexity_time, 'Time')
Also, the scatterplot is replicated.
show_scatter_correlation(df_cognitive_complexity_time,
'Cognitive Complexity',
'Time')
A function for the computation of the initial number of outliers is defined. The z-score is taken into account for the detection of the outliers. Also, a threshold value is set to 3 in order to consider as much data-points as possible.
threshold = 3
def print_outliers_based_on_zscore(df):
z = np.abs(stats.zscore(df))
print(f'Number of outliers: {np.where(z > threshold)[0].shape[0]}')
print_outliers_based_on_zscore(df_cognitive_complexity_time)
Number of outliers: 1
A function for the iterative detection and removal of the outliers, based on the z-score, is defined.
def remove_outliers_zscore(orig_df, print_outliers):
df = orig_df.copy()
done = False
while not done:
z = np.abs(stats.zscore(df))
max_zscore = np.amax(z)
if max_zscore < threshold:
done = True
break
to_remove = np.where(z == max_zscore)[0]
df.drop(df.index[to_remove],
inplace=True)
if print_outliers:
print(f'Number of outliers removed: {orig_df.shape[0] - df.shape[0]}')
return df
df_cognitive_complexity_time_without_outliers = \
remove_outliers_zscore(df_cognitive_complexity_time, print_outliers=True)
df_cognitive_complexity_time_without_outliers.info()
Number of outliers removed: 1 <class 'pandas.core.frame.DataFrame'> Int64Index: 49 entries, 0 to 49 Data columns (total 2 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Cognitive Complexity 49 non-null int64 1 Time 49 non-null int64 dtypes: int64(2) memory usage: 1.1 KB
The correlation of the dataset without the outliers is shown.
print_kendall_correlation(df_cognitive_complexity_time_without_outliers,
'Cognitive Complexity',
'Time')
Correlation between Cognitive Complexity and Time: KendalltauResult(correlation=0.34857566985194194, pvalue=0.0006418563024944508)
show_scatter_correlation(df_cognitive_complexity_time_without_outliers,
'Cognitive Complexity',
'Time')
This section analyzes the correlation between Time and McCabe Cyclomatic Complexity in more details.
The same analysis of the previous section is proposed.
df_cyclomatic_complexity_time = df_experiment_statistics.copy()
df_cyclomatic_complexity_time.head()
| UserID | Name | Path | Time | Cognitive Complexity | McCC | TCLOC | LOC | TLLOC | NL | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | Object stringToValue(String string) | JSONObject.java | 49 | 9 | 9 | 17 | 30 | 22 | 2 |
| 1 | 1 | Writer write(Writer writer, int indentFactor, ... | JSONObject.java | 26 | 23 | 12 | 27 | 54 | 53 | 5 |
| 2 | 1 | BigInteger objectToBigInteger(Object val, BigI... | JSONObject.java | 35 | 11 | 13 | 12 | 36 | 30 | 2 |
| 3 | 1 | Object nextMeta() | XMLTokener.java | 27 | 18 | 25 | 12 | 56 | 56 | 3 |
| 4 | 1 | int findStringEnd(JsonIterator iter) | IterImplSkip.java | 22 | 24 | 10 | 7 | 30 | 26 | 5 |
df_cyclomatic_complexity_time = df_cyclomatic_complexity_time[['McCC',
'Time']]
df_cyclomatic_complexity_time.head()
| McCC | Time | |
|---|---|---|
| 0 | 9 | 49 |
| 1 | 12 | 26 |
| 2 | 13 | 35 |
| 3 | 25 | 27 |
| 4 | 10 | 22 |
print_kendall_correlation(df_cyclomatic_complexity_time,
'McCC',
'Time')
Correlation between McCC and Time: KendalltauResult(correlation=0.2756671500231326, pvalue=0.006683927833596138)
show_feature_boxplot(df_cyclomatic_complexity_time, 'McCC')
show_feature_boxplot(df_cyclomatic_complexity_time, 'Time')
show_scatter_correlation(df_cyclomatic_complexity_time,
'McCC',
'Time')
print_outliers_based_on_zscore(df_cyclomatic_complexity_time)
Number of outliers: 1
df_cyclomatic_complexity_time_without_outliers = \
remove_outliers_zscore(df_cyclomatic_complexity_time, print_outliers=True)
df_cyclomatic_complexity_time_without_outliers.info()
Number of outliers removed: 1 <class 'pandas.core.frame.DataFrame'> Int64Index: 49 entries, 0 to 49 Data columns (total 2 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 McCC 49 non-null int64 1 Time 49 non-null int64 dtypes: int64(2) memory usage: 1.1 KB
print_kendall_correlation(df_cyclomatic_complexity_time_without_outliers,
'McCC',
'Time')
Correlation between McCC and Time: KendalltauResult(correlation=0.29448549124756274, pvalue=0.004170078229891685)
show_scatter_correlation(df_cyclomatic_complexity_time_without_outliers,
'McCC',
'Time')
This section analyzes the correlation between Time and the Lines of Code in more details.
The same analysis of the previous section is proposed.
df_loc_time = df_experiment_statistics.copy()
df_loc_time.head()
| UserID | Name | Path | Time | Cognitive Complexity | McCC | TCLOC | LOC | TLLOC | NL | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | Object stringToValue(String string) | JSONObject.java | 49 | 9 | 9 | 17 | 30 | 22 | 2 |
| 1 | 1 | Writer write(Writer writer, int indentFactor, ... | JSONObject.java | 26 | 23 | 12 | 27 | 54 | 53 | 5 |
| 2 | 1 | BigInteger objectToBigInteger(Object val, BigI... | JSONObject.java | 35 | 11 | 13 | 12 | 36 | 30 | 2 |
| 3 | 1 | Object nextMeta() | XMLTokener.java | 27 | 18 | 25 | 12 | 56 | 56 | 3 |
| 4 | 1 | int findStringEnd(JsonIterator iter) | IterImplSkip.java | 22 | 24 | 10 | 7 | 30 | 26 | 5 |
df_loc_time = df_loc_time[['LOC', 'Time']]
df_loc_time.head()
| LOC | Time | |
|---|---|---|
| 0 | 30 | 49 |
| 1 | 54 | 26 |
| 2 | 36 | 35 |
| 3 | 56 | 27 |
| 4 | 30 | 22 |
print_kendall_correlation(df_loc_time,
'LOC',
'Time')
Correlation between LOC and Time: KendalltauResult(correlation=0.27109376363932897, pvalue=0.0067316173454244595)
show_feature_boxplot(df_loc_time, 'LOC')
show_feature_boxplot(df_loc_time, 'Time')
show_scatter_correlation(df_loc_time,
'LOC',
'Time')
print_outliers_based_on_zscore(df_loc_time)
Number of outliers: 1
df_loc_time_without_outliers = \
remove_outliers_zscore(df_loc_time, print_outliers=True)
df_loc_time_without_outliers.info()
Number of outliers removed: 1 <class 'pandas.core.frame.DataFrame'> Int64Index: 49 entries, 0 to 49 Data columns (total 2 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 LOC 49 non-null int64 1 Time 49 non-null int64 dtypes: int64(2) memory usage: 1.1 KB
print_kendall_correlation(df_loc_time_without_outliers,
'LOC',
'Time')
Correlation between LOC and Time: KendalltauResult(correlation=0.2743600385627248, pvalue=0.006641156471904747)
show_scatter_correlation(df_loc_time_without_outliers,
'LOC',
'Time')
This section analyzes the correlation between Time and the Total Comments Lines of Code in more details.
The same analysis of the previous section is proposed.
df_total_comments_loc_time = df_experiment_statistics.copy()
df_total_comments_loc_time.head()
| UserID | Name | Path | Time | Cognitive Complexity | McCC | TCLOC | LOC | TLLOC | NL | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | Object stringToValue(String string) | JSONObject.java | 49 | 9 | 9 | 17 | 30 | 22 | 2 |
| 1 | 1 | Writer write(Writer writer, int indentFactor, ... | JSONObject.java | 26 | 23 | 12 | 27 | 54 | 53 | 5 |
| 2 | 1 | BigInteger objectToBigInteger(Object val, BigI... | JSONObject.java | 35 | 11 | 13 | 12 | 36 | 30 | 2 |
| 3 | 1 | Object nextMeta() | XMLTokener.java | 27 | 18 | 25 | 12 | 56 | 56 | 3 |
| 4 | 1 | int findStringEnd(JsonIterator iter) | IterImplSkip.java | 22 | 24 | 10 | 7 | 30 | 26 | 5 |
df_total_comments_loc_time = df_total_comments_loc_time[['TCLOC', 'Time']]
df_total_comments_loc_time.head()
| TCLOC | Time | |
|---|---|---|
| 0 | 17 | 49 |
| 1 | 27 | 26 |
| 2 | 12 | 35 |
| 3 | 12 | 27 |
| 4 | 7 | 22 |
print_kendall_correlation(df_total_comments_loc_time,
'TCLOC',
'Time')
Correlation between TCLOC and Time: KendalltauResult(correlation=0.1828899735871646, pvalue=0.07880577563232574)
show_feature_boxplot(df_total_comments_loc_time, 'TCLOC')
show_feature_boxplot(df_total_comments_loc_time, 'Time')
show_scatter_correlation(df_total_comments_loc_time,
'TCLOC',
'Time')
print_outliers_based_on_zscore(df_total_comments_loc_time)
Number of outliers: 1
df_total_comments_loc_time_without_outliers = \
remove_outliers_zscore(df_total_comments_loc_time, print_outliers=True)
df_total_comments_loc_time_without_outliers.info()
Number of outliers removed: 1 <class 'pandas.core.frame.DataFrame'> Int64Index: 49 entries, 0 to 49 Data columns (total 2 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 TCLOC 49 non-null int64 1 Time 49 non-null int64 dtypes: int64(2) memory usage: 1.1 KB
print_kendall_correlation(df_total_comments_loc_time_without_outliers,
'TCLOC',
'Time')
Correlation between TCLOC and Time: KendalltauResult(correlation=0.15573852788470482, pvalue=0.1394762676101472)
show_scatter_correlation(df_total_comments_loc_time_without_outliers,
'TCLOC',
'Time')
This section analyzes the correlation between Time and the Total Logical Lines of Code in more details.
The same analysis of the previous section is proposed.
df_total_logical_loc_time = df_experiment_statistics.copy()
df_total_logical_loc_time.head()
| UserID | Name | Path | Time | Cognitive Complexity | McCC | TCLOC | LOC | TLLOC | NL | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | Object stringToValue(String string) | JSONObject.java | 49 | 9 | 9 | 17 | 30 | 22 | 2 |
| 1 | 1 | Writer write(Writer writer, int indentFactor, ... | JSONObject.java | 26 | 23 | 12 | 27 | 54 | 53 | 5 |
| 2 | 1 | BigInteger objectToBigInteger(Object val, BigI... | JSONObject.java | 35 | 11 | 13 | 12 | 36 | 30 | 2 |
| 3 | 1 | Object nextMeta() | XMLTokener.java | 27 | 18 | 25 | 12 | 56 | 56 | 3 |
| 4 | 1 | int findStringEnd(JsonIterator iter) | IterImplSkip.java | 22 | 24 | 10 | 7 | 30 | 26 | 5 |
df_total_logical_loc_time = df_total_logical_loc_time[['TLLOC', 'Time']]
df_total_logical_loc_time.head()
| TLLOC | Time | |
|---|---|---|
| 0 | 22 | 49 |
| 1 | 53 | 26 |
| 2 | 30 | 35 |
| 3 | 56 | 27 |
| 4 | 26 | 22 |
print_kendall_correlation(df_total_logical_loc_time,
'TLLOC',
'Time')
Correlation between TLLOC and Time: KendalltauResult(correlation=0.3076331246861424, pvalue=0.002080338510739423)
show_feature_boxplot(df_total_logical_loc_time, 'TLLOC')
show_feature_boxplot(df_total_logical_loc_time, 'Time')
show_scatter_correlation(df_total_logical_loc_time,
'TLLOC',
'Time')
print_outliers_based_on_zscore(df_total_logical_loc_time)
Number of outliers: 1
df_total_logical_loc_time_without_outliers = \
remove_outliers_zscore(df_total_logical_loc_time, print_outliers=True)
df_total_logical_loc_time_without_outliers.info()
Number of outliers removed: 1 <class 'pandas.core.frame.DataFrame'> Int64Index: 49 entries, 0 to 49 Data columns (total 2 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 TLLOC 49 non-null int64 1 Time 49 non-null int64 dtypes: int64(2) memory usage: 1.1 KB
print_kendall_correlation(df_total_logical_loc_time_without_outliers,
'TLLOC',
'Time')
Correlation between TLLOC and Time: KendalltauResult(correlation=0.33639686256894596, pvalue=0.0008734586529081415)
show_scatter_correlation(df_total_logical_loc_time_without_outliers,
'TLLOC',
'Time')
This section analyzes the correlation between Time and the Nesting Level in more details.
The same analysis of the previous section is proposed.
df_nesting_level_time = df_experiment_statistics.copy()
df_nesting_level_time.head()
| UserID | Name | Path | Time | Cognitive Complexity | McCC | TCLOC | LOC | TLLOC | NL | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | Object stringToValue(String string) | JSONObject.java | 49 | 9 | 9 | 17 | 30 | 22 | 2 |
| 1 | 1 | Writer write(Writer writer, int indentFactor, ... | JSONObject.java | 26 | 23 | 12 | 27 | 54 | 53 | 5 |
| 2 | 1 | BigInteger objectToBigInteger(Object val, BigI... | JSONObject.java | 35 | 11 | 13 | 12 | 36 | 30 | 2 |
| 3 | 1 | Object nextMeta() | XMLTokener.java | 27 | 18 | 25 | 12 | 56 | 56 | 3 |
| 4 | 1 | int findStringEnd(JsonIterator iter) | IterImplSkip.java | 22 | 24 | 10 | 7 | 30 | 26 | 5 |
df_nesting_level_time = df_nesting_level_time[['NL', 'Time']]
df_nesting_level_time.head()
| NL | Time | |
|---|---|---|
| 0 | 2 | 49 |
| 1 | 5 | 26 |
| 2 | 2 | 35 |
| 3 | 3 | 27 |
| 4 | 5 | 22 |
print_kendall_correlation(df_nesting_level_time,
'NL',
'Time')
Correlation between NL and Time: KendalltauResult(correlation=0.14535043640755838, pvalue=0.16492818562000944)
show_feature_boxplot(df_nesting_level_time, 'NL')
show_feature_boxplot(df_nesting_level_time, 'Time')
show_scatter_correlation(df_nesting_level_time,
'NL',
'Time')
print_outliers_based_on_zscore(df_nesting_level_time)
Number of outliers: 1
df_nesting_level_time_without_outliers = \
remove_outliers_zscore(df_nesting_level_time, print_outliers=True)
df_nesting_level_time_without_outliers.info()
Number of outliers removed: 1 <class 'pandas.core.frame.DataFrame'> Int64Index: 49 entries, 0 to 49 Data columns (total 2 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 NL 49 non-null int64 1 Time 49 non-null int64 dtypes: int64(2) memory usage: 1.1 KB
print_kendall_correlation(df_nesting_level_time_without_outliers,
'NL',
'Time')
Correlation between NL and Time: KendalltauResult(correlation=0.18986374332250777, pvalue=0.07305630855383179)
show_scatter_correlation(df_nesting_level_time_without_outliers,
'NL',
'Time')
Cognitive Complexity has shown the most significant correlation with Time, starting from 0.298 before the outliers removal operation and reaching 0.349 after the outliers removal operation (1 outlier has been removed from 50 initial data-points).
Despite the low amount of data available, there seem to be a slight degree of correlation between these two variables.
Considering the other features, TLLOC has shown an initial correlation of 0.308 and a correlation of 0.336 after the outliers removal operation. McCC has shown an initial correlation of 0.276 and a correlation of 0.294 after the outliers removal operation. LOC has shown an initial correlation of 0.271 and a correlation of 0.274 after the outliers removal operation. NL has shown an initial correlation of 0.145 and a correlation of 0.190 after the outliers removal operation. TCLOC is the only variable that has worsen its correlation degree with Time, starting from an initial correlation of 0.183 and reaching a correlation of 0.156 after the outliers removal operation. In all these cases, only 1 outlier from 50 initial data-points has been detected and removed.
Thus, TCLOC and NL have shown a very low degree of correlation with the time required to solve the tasks. However, TLLOC has shown a correlation degree very similar to the one reached by Cognitive Complexity. Finally, McCC and LOC have shown a slightly worse correlation with Time than Cognitive Complexity.
def get_correlation_info(df, feature):
return stats.kendalltau(df[feature], df['Time'])
def get_initial_final_correlation_info(df_initial, df_final, feature):
initial_tau, initial_p_value = get_correlation_info(df_initial, feature)
final_tau, final_p_value = get_correlation_info(df_final, feature)
return [feature, round(initial_tau, 3), round(initial_p_value, 3),
round(final_tau, 3), round(final_p_value, 3)]
def compute_correlation_info_for_all_features():
features_list = ['Cognitive Complexity', 'McCC', 'LOC', 'TCLOC', 'TLLOC', 'NL']
initial_dfs_list = [df_cognitive_complexity_time, df_cyclomatic_complexity_time,
df_loc_time, df_total_comments_loc_time,
df_total_logical_loc_time, df_nesting_level_time]
final_dfs_list = [df_cognitive_complexity_time_without_outliers,
df_cyclomatic_complexity_time_without_outliers,
df_loc_time_without_outliers,
df_total_comments_loc_time_without_outliers,
df_total_logical_loc_time_without_outliers,
df_nesting_level_time_without_outliers]
features_correlation_info = []
for i in range(len(features_list)):
feature_correlation_info = get_initial_final_correlation_info(
initial_dfs_list[i], final_dfs_list[i], features_list[i])
features_correlation_info.append(feature_correlation_info)
return pd.DataFrame(features_correlation_info,
columns=['Feature', 'Initial_tau', 'Initial_p_value',
'Final_tau', 'Final_p_value'])\
.set_index('Feature', drop=True)
correlation_analysis_results = compute_correlation_info_for_all_features()
correlation_analysis_results
| Initial_tau | Initial_p_value | Final_tau | Final_p_value | |
|---|---|---|---|---|
| Feature | ||||
| Cognitive Complexity | 0.298 | 0.003 | 0.349 | 0.001 |
| McCC | 0.276 | 0.007 | 0.294 | 0.004 |
| LOC | 0.271 | 0.007 | 0.274 | 0.007 |
| TCLOC | 0.183 | 0.079 | 0.156 | 0.139 |
| TLLOC | 0.308 | 0.002 | 0.336 | 0.001 |
| NL | 0.145 | 0.165 | 0.190 | 0.073 |
This section analyzes the possibility to create a multivariate regression model to predict the time needed to solve a task.
A function is defined to create and evaluate a regression model.
def generate_evaluate_univariate_model(df_train, df_test):
x_train = sm.add_constant(df_train.drop(columns=['Time']))
model = sm.OLS(df_train['Time'],
x_train).fit()
x_test = sm.add_constant(df_test.drop(columns=['Time']))
predictions = model.predict(x_test)
return [model.rsquared,
model.pvalues[-1],
mse(df_test['Time'], predictions),
rmse(df_test['Time'], predictions)]
Also, a function is defined to print the scores.
def compute_scores_univariate_model(scores_list):
df_scores = pd.DataFrame(scores_list,
columns=['r2_score', 'p_value', 'MSE', 'RMSE'])
return [df_scores['r2_score'].mean(),
df_scores['p_value'].mean(),
df_scores['MSE'].mean(),
df_scores['RMSE'].mean()]
In order to evaluate the performances of the model, a function that applies the 10-times 10-fold cross-validation is defined. Specifically, this function computes 100 train sets and 100 test sets. Then, for each couple of train and test sets, the function builds and fits a linear regression model with a train_set where the outliers have been removed, computes the predictions of the model and finally evaluates its performances using all the data in the test_set, outliers included.
def perform_10_times_10_fold_cross_validation_univariate_linear_regression(df):
scores_list = []
cv = RepeatedKFold(n_splits=10, n_repeats=10, random_state=42)
for train, test in cv.split(df):
df_train = df.iloc[train]
df_test = df.iloc[test]
df_train_without_outliers = remove_outliers_zscore(df_train,
print_outliers=False)
model_scores = \
generate_evaluate_univariate_model(df_train_without_outliers,
df_test)
scores_list.append(model_scores)
return compute_scores_univariate_model(scores_list)
Also, a function to print the scores of a model is defined.
def print_univariate_model_performances(scores_list):
print('R2 score:', scores_list[0])
print('p_value:', scores_list[1])
print('Mean Squared Error:', scores_list[2])
print('Root Mean Squared Error:', scores_list[3])
This section analyzes the possibility to create a statistically significant regression model to predict the time needed to solve a task using only Cognitive Complexity.
cognitive_complexity_time_scores = \
perform_10_times_10_fold_cross_validation_univariate_linear_regression(
df_cognitive_complexity_time)
print_univariate_model_performances(cognitive_complexity_time_scores)
R2 score: 0.10682569072336559 p_value: 0.037553402083807724 Mean Squared Error: 143.06059339437599 Root Mean Squared Error: 10.616322739150224
This section analyzes the possibility to create a statistically significant regression model to predict the time needed to solve a task using only McCabe Cyclomatic Complexity.
cyclomatic_complexity_time_scores = \
perform_10_times_10_fold_cross_validation_univariate_linear_regression(
df_cyclomatic_complexity_time)
print_univariate_model_performances(cyclomatic_complexity_time_scores)
R2 score: 0.08805162467875212 p_value: 0.06684742062939121 Mean Squared Error: 142.1505107900848 Root Mean Squared Error: 10.571364941057707
This section analyzes the possibility to create a statistically significant regression model to predict the time needed to solve a task using only Lines of Code.
loc_time_scores = \
perform_10_times_10_fold_cross_validation_univariate_linear_regression(
df_loc_time)
print_univariate_model_performances(loc_time_scores)
R2 score: 0.10744672059389221 p_value: 0.03784595562209539 Mean Squared Error: 133.81360564184075 Root Mean Squared Error: 10.271586494345065
This section analyzes the possibility to create a statistically significant regression model to predict the time needed to solve a task using only Total Comments Lines of Code.
total_comments_loc_time_scores = \
perform_10_times_10_fold_cross_validation_univariate_linear_regression(
df_total_comments_loc_time)
print_univariate_model_performances(total_comments_loc_time_scores)
R2 score: 0.029802343943847585 p_value: 0.29191432908940984 Mean Squared Error: 140.08605742827305 Root Mean Squared Error: 10.739959481454141
This section analyzes the possibility to create a statistically significant regression model to predict the time needed to solve a task using only Total Logical Lines of Code.
total_logical_loc_time_scores = \
perform_10_times_10_fold_cross_validation_univariate_linear_regression(
df_total_logical_loc_time)
print_univariate_model_performances(total_logical_loc_time_scores)
R2 score: 0.08284666608851066 p_value: 0.09231864634142856 Mean Squared Error: 155.89296911004368 Root Mean Squared Error: 11.228236618989685
This section analyzes the possibility to create a statistically significant regression model to predict the time needed to solve a task using only Nesting Level.
nesting_level_time_scores = \
perform_10_times_10_fold_cross_validation_univariate_linear_regression(
df_nesting_level_time)
print_univariate_model_performances(nesting_level_time_scores)
R2 score: 0.05044005510802533 p_value: 0.17082854225117847 Mean Squared Error: 146.8332826388065 Root Mean Squared Error: 10.80143289561366
Using Cognitive Complexity as the unique predictor, the regression model shows a R2 score of 0.107 and a p-value of 0.038. The RMSE equals 10.616.
Even though the p-value suggests that there is some statistical significance in the model, the R2 score seems to be very small.
Considering all the other metrics, LOC shows a R2 score of 0.107 and a p-value of 0.038. McCC shows a R2 score of 0.088 and a p-value of 0.067. TLLOC shows a R2 score of 0.083 with a p-value of 0.092. NL shows a R2 score of 0.0.050 and a p-value of 0.171 TCLOC shows a R2 score of 0.030 with a p-value of 0.292. In all of these cases, the RMSE stands around 10.27 and 10.8, rather than TLLOC, where the RMSE equals 11.228.
Thus, the model built using LOC shows the same performances as the one built using Cognitive Complexity. The models built using either McCC or TLLOC show some slighly worse performances than the model built using Cognitive Complexity. Also, in both these cases, the p-value is greater than the threshold value of 0.05. Finally, the models built using either NL or TCLOC show the worst performances, with p-values strongly greater than 0.05.
def show_performances_univariate_models_for_all_features():
features_list = ['Cognitive Complexity', 'McCC', 'LOC', 'TCLOC', 'TLLOC', 'NL']
regression_scores_list = [cognitive_complexity_time_scores,
cyclomatic_complexity_time_scores,
loc_time_scores,
total_comments_loc_time_scores,
total_logical_loc_time_scores,
nesting_level_time_scores]
model_performances_info = []
for i in range(len(features_list)):
model_performances = [features_list[i]]
model_performances += list(map(lambda x:round(x,3), regression_scores_list[i]))
model_performances_info.append(model_performances)
return pd.DataFrame(model_performances_info,
columns=['Feature', 'R2 Score', 'P-value', 'MSE', 'RMSE'])\
.set_index('Feature', drop=True)
univariate_analysis_results = \
show_performances_univariate_models_for_all_features()
univariate_analysis_results.sort_values(by=['R2 Score'], ascending=False)
| R2 Score | P-value | MSE | RMSE | |
|---|---|---|---|---|
| Feature | ||||
| Cognitive Complexity | 0.107 | 0.038 | 143.061 | 10.616 |
| LOC | 0.107 | 0.038 | 133.814 | 10.272 |
| McCC | 0.088 | 0.067 | 142.151 | 10.571 |
| TLLOC | 0.083 | 0.092 | 155.893 | 11.228 |
| NL | 0.050 | 0.171 | 146.833 | 10.801 |
| TCLOC | 0.030 | 0.292 | 140.086 | 10.740 |
This section analyzes the possibility to create a multivariate regression model to predict the time needed to solve a task.
df_multivariate_analysis = df_experiment_statistics.copy()
df_multivariate_analysis.head()
| UserID | Name | Path | Time | Cognitive Complexity | McCC | TCLOC | LOC | TLLOC | NL | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | Object stringToValue(String string) | JSONObject.java | 49 | 9 | 9 | 17 | 30 | 22 | 2 |
| 1 | 1 | Writer write(Writer writer, int indentFactor, ... | JSONObject.java | 26 | 23 | 12 | 27 | 54 | 53 | 5 |
| 2 | 1 | BigInteger objectToBigInteger(Object val, BigI... | JSONObject.java | 35 | 11 | 13 | 12 | 36 | 30 | 2 |
| 3 | 1 | Object nextMeta() | XMLTokener.java | 27 | 18 | 25 | 12 | 56 | 56 | 3 |
| 4 | 1 | int findStringEnd(JsonIterator iter) | IterImplSkip.java | 22 | 24 | 10 | 7 | 30 | 26 | 5 |
Some useless columns can be dropped.
df_multivariate_analysis.drop(columns=['UserID',
'Name',
'Path'],
inplace=True)
df_multivariate_analysis.head()
| Time | Cognitive Complexity | McCC | TCLOC | LOC | TLLOC | NL | |
|---|---|---|---|---|---|---|---|
| 0 | 49 | 9 | 9 | 17 | 30 | 22 | 2 |
| 1 | 26 | 23 | 12 | 27 | 54 | 53 | 5 |
| 2 | 35 | 11 | 13 | 12 | 36 | 30 | 2 |
| 3 | 27 | 18 | 25 | 12 | 56 | 56 | 3 |
| 4 | 22 | 24 | 10 | 7 | 30 | 26 | 5 |
As for the univariate analysis, some functions are defined in order to build, fit and evaluate the multivariate models.
def generate_evaluate_multivariate_model(df_train, df_test):
x_train = sm.add_constant(df_train.drop(columns=['Time']))
model = sm.OLS(df_train['Time'],
x_train).fit()
x_test = sm.add_constant(df_test.drop(columns=['Time']))
predictions = model.predict(x_test)
return [model.rsquared,
model.pvalues[-2],
model.pvalues[-1],
mse(df_test['Time'], predictions),
rmse(df_test['Time'], predictions)]
def compute_scores_multivariate_model(scores_list):
df_scores = pd.DataFrame(scores_list,
columns=['r2_score',
'p_value_1',
'p_value_2',
'MSE',
'RMSE'])
return [df_scores['r2_score'].mean(),
df_scores[f'p_value_1'].mean(),
df_scores[f'p_value_2'].mean(),
df_scores['MSE'].mean(),
df_scores['RMSE'].mean()]
def perform_10_times_10_fold_cross_validation_multivariate_linear_regression(df):
scores_list = []
cv = RepeatedKFold(n_splits=10, n_repeats=10, random_state=42)
for train, test in cv.split(df):
df_train = df.iloc[train]
df_test = df.iloc[test]
df_train_without_outliers = remove_outliers_zscore(df_train,
print_outliers=False)
model_scores = \
generate_evaluate_multivariate_model(df_train_without_outliers,
df_test)
scores_list.append(model_scores)
return compute_scores_multivariate_model(scores_list)
def print_multivariate_model_performances(scores_list, feature_1, feature_2):
print('R2 score:', scores_list[0])
print(f'p_value_{feature_1}:', scores_list[1])
print(f'p_value_{feature_2}:', scores_list[2])
print('Mean Squared Error:', scores_list[3])
print('Root Mean Squared Error:', scores_list[4])
This section analyzes the possibility to create a statistically significant regression model using couples of predictors always composed by Cognitive Complexity and a feature chosen among the other ones.
df_cognitive_complexity_McCC_time = df_multivariate_analysis[[
'Cognitive Complexity', 'McCC', 'Time']]
df_cognitive_complexity_McCC_time.head()
| Cognitive Complexity | McCC | Time | |
|---|---|---|---|
| 0 | 9 | 9 | 49 |
| 1 | 23 | 12 | 26 |
| 2 | 11 | 13 | 35 |
| 3 | 18 | 25 | 27 |
| 4 | 24 | 10 | 22 |
cognitive_complexity_McCC_time_scores = \
perform_10_times_10_fold_cross_validation_multivariate_linear_regression(
df_cognitive_complexity_McCC_time)
print_multivariate_model_performances(cognitive_complexity_McCC_time_scores,
'Cognitive_Complexity',
'McCC')
R2 score: 0.16026363998641757 p_value_Cognitive_Complexity: 0.08056504972556645 p_value_McCC: 0.14109550646626196 Mean Squared Error: 142.81004422929837 Root Mean Squared Error: 10.485280796879854
df_cognitive_complexity_loc_time = df_multivariate_analysis[[
'Cognitive Complexity', 'LOC', 'Time']]
df_cognitive_complexity_loc_time.head()
| Cognitive Complexity | LOC | Time | |
|---|---|---|---|
| 0 | 9 | 30 | 49 |
| 1 | 23 | 54 | 26 |
| 2 | 11 | 36 | 35 |
| 3 | 18 | 56 | 27 |
| 4 | 24 | 30 | 22 |
cognitive_complexity_loc_time_scores = \
perform_10_times_10_fold_cross_validation_multivariate_linear_regression(
df_cognitive_complexity_loc_time)
print_multivariate_model_performances(cognitive_complexity_loc_time_scores,
'Cognitive_Complexity',
'LOC')
R2 score: 0.1753533645108051 p_value_Cognitive_Complexity: 0.0826563374944404 p_value_LOC: 0.09316779962236105 Mean Squared Error: 136.0963037277761 Root Mean Squared Error: 10.22009346745336
df_cognitive_complexity_tcloc_time = df_multivariate_analysis[[
'Cognitive Complexity', 'TCLOC', 'Time']]
df_cognitive_complexity_tcloc_time.head()
| Cognitive Complexity | TCLOC | Time | |
|---|---|---|---|
| 0 | 9 | 17 | 49 |
| 1 | 23 | 27 | 26 |
| 2 | 11 | 12 | 35 |
| 3 | 18 | 12 | 27 |
| 4 | 24 | 7 | 22 |
cognitive_complexity_tcloc_time_scores = \
perform_10_times_10_fold_cross_validation_multivariate_linear_regression(
df_cognitive_complexity_tcloc_time)
print_multivariate_model_performances(cognitive_complexity_tcloc_time_scores,
'Cognitive_Complexity',
'TCLOC')
R2 score: 0.1333110875768662 p_value_Cognitive_Complexity: 0.03912263359803965 p_value_TCLOC: 0.3031086620349795 Mean Squared Error: 140.93763856884883 Root Mean Squared Error: 10.610298922424981
df_cognitive_complexity_tlloc_time = df_multivariate_analysis[[
'Cognitive Complexity', 'TLLOC', 'Time']]
df_cognitive_complexity_tlloc_time.head()
| Cognitive Complexity | TLLOC | Time | |
|---|---|---|---|
| 0 | 9 | 22 | 49 |
| 1 | 23 | 53 | 26 |
| 2 | 11 | 30 | 35 |
| 3 | 18 | 56 | 27 |
| 4 | 24 | 26 | 22 |
cognitive_complexity_tlloc_time_scores = \
perform_10_times_10_fold_cross_validation_multivariate_linear_regression(
df_cognitive_complexity_tlloc_time)
print_multivariate_model_performances(cognitive_complexity_tlloc_time_scores,
'Cognitive_Complexity',
'TLLOC')
R2 score: 0.12730993614160172 p_value_Cognitive_Complexity: 0.19743607504694335 p_value_TLLOC: 0.728292246255959 Mean Squared Error: 164.29385029006033 Root Mean Squared Error: 11.48432232397964
df_cognitive_complexity_nl_time = df_multivariate_analysis[[
'Cognitive Complexity', 'NL', 'Time']]
df_cognitive_complexity_nl_time.head()
| Cognitive Complexity | NL | Time | |
|---|---|---|---|
| 0 | 9 | 2 | 49 |
| 1 | 23 | 5 | 26 |
| 2 | 11 | 2 | 35 |
| 3 | 18 | 3 | 27 |
| 4 | 24 | 5 | 22 |
cognitive_complexity_nl_time_scores = \
perform_10_times_10_fold_cross_validation_multivariate_linear_regression(
df_cognitive_complexity_nl_time)
print_multivariate_model_performances(cognitive_complexity_nl_time_scores,
'Cognitive_Complexity',
'NL')
R2 score: 0.12220290864252044 p_value_Cognitive_Complexity: 0.09575584539949519 p_value_NL: 0.47570555820059307 Mean Squared Error: 145.06814229284194 Root Mean Squared Error: 10.716259314340197
All the regression models that can be built with couples of predictors, in which one of the variables is Cognitive Complexity, have been built and evaluated.
The model built with the couple Cognitive Complexity+LOC shows the best value of R2 score (0.175), where the p-values equal 0.083 for Cognitive Complexity and 0.093 for LOC. The model built with the couple Cognitive Complexity+McCC shows a R2 score of 0.160, and the p-values equal 0.081 for Cognitive Complexity and 0.141 for McCC. The model built with the couple Cognitive Complexity+TCLOC shows a R2 score of 0.133, and the p-values equal 0.039 for Cognitive Complexity and 0.303 for TCLOC. The model built with the couple Cognitive Complexity+TLLOC shows a R2 score of 0.127, and the p-values equal 0.197 for Cognitive Complexity and 0.728 for TLLOC. The model built with the couple Cognitive Complexity+NL shows a R2 score of 0.122, and the p-values equal 0.096 for Cognitive Complexity and 0.476 for NL. Even in this analysis, the RMSE stands between 10.22 and 10.72 for all the considered models, rather than TLLOC, where the RMSE equals 11.48.
Thus, considering both the R2 score and the p-values, the model built using both Cognitive Complexity and LOC shows the best performances, even though the p-values are slightly greater than 0.05 and the R2 score seems to be small. The performances of the model built using both Cognitive Complexity and McCC are slightly worse than the ones of the best model, and they are out of the range of the acceptable models' performances. The performances of the other three models are even worse, thus these models cannot be considered.
def show_performances_multivariate_models_for_cognitive_complexity_couples():
features_list = ['Cognitive Complexity', 'McCC', 'LOC', 'TCLOC', 'TLLOC', 'NL']
regression_scores_list = [cognitive_complexity_McCC_time_scores,
cognitive_complexity_loc_time_scores,
cognitive_complexity_tcloc_time_scores,
cognitive_complexity_tlloc_time_scores,
cognitive_complexity_nl_time_scores]
model_performances_info = []
for i in range(len(features_list)-1):
model_performances = [f'Cognitive_Complexity+{features_list[i+1]}']
model_performances += list(map(lambda x:round(x,3), regression_scores_list[i]))
model_performances_info.append(model_performances)
return pd.DataFrame(model_performances_info,
columns=['Features', 'R2 Score', 'P-value Feature 1',
'P-value Feature 2', 'MSE', 'RMSE'])\
.set_index('Features', drop=True)
cognitive_complexity_multivariate_analysis_results = \
show_performances_multivariate_models_for_cognitive_complexity_couples()
cognitive_complexity_multivariate_analysis_results
| R2 Score | P-value Feature 1 | P-value Feature 2 | MSE | RMSE | |
|---|---|---|---|---|---|
| Features | |||||
| Cognitive_Complexity+McCC | 0.160 | 0.081 | 0.141 | 142.810 | 10.485 |
| Cognitive_Complexity+LOC | 0.175 | 0.083 | 0.093 | 136.096 | 10.220 |
| Cognitive_Complexity+TCLOC | 0.133 | 0.039 | 0.303 | 140.938 | 10.610 |
| Cognitive_Complexity+TLLOC | 0.127 | 0.197 | 0.728 | 164.294 | 11.484 |
| Cognitive_Complexity+NL | 0.122 | 0.096 | 0.476 | 145.068 | 10.716 |
This section analyzes the possibility to create a statistically significant regression model without using Cognitive Complexity, thus using couples of predictors such that none of them is Cognitive Complexity.
def generate_evaluate_multivariate_models_other_couples():
scores_list = []
for couple in itertools.combinations(['McCC', 'LOC', 'TCLOC', 'TLLOC', 'NL'], 2):
df = df_multivariate_analysis[[couple[0], couple[1], 'Time']]
scores = \
perform_10_times_10_fold_cross_validation_multivariate_linear_regression(df)
scores_list.append(scores)
print_multivariate_model_performances(scores, couple[0], couple[1])
print()
return scores_list
multivariate_models_other_couples_scores = \
generate_evaluate_multivariate_models_other_couples()
R2 score: 0.11982943863430437 p_value_McCC: 0.5557225689303581 p_value_LOC: 0.28375165108483574 Mean Squared Error: 141.28962446253396 Root Mean Squared Error: 10.566883825865602 R2 score: 0.09761616992240812 p_value_McCC: 0.11449373103880447 p_value_TCLOC: 0.5623470810683754 Mean Squared Error: 143.236018163207 Root Mean Squared Error: 10.693331274584423 R2 score: 0.13942833990971337 p_value_McCC: 0.19786135746149558 p_value_TLLOC: 0.16786419674653966 Mean Squared Error: 158.01147932803235 Root Mean Squared Error: 11.206735933123037 R2 score: 0.12113286727527196 p_value_McCC: 0.09638767404499089 p_value_NL: 0.26807557705508656 Mean Squared Error: 146.63451245250909 Root Mean Squared Error: 10.640769802717516 R2 score: 0.10948995868242252 p_value_LOC: 0.07867725157718819 p_value_TCLOC: 0.8094330326678028 Mean Squared Error: 137.2309025490942 Root Mean Squared Error: 10.453551572897066 R2 score: 0.14811851964202108 p_value_LOC: 0.1486745721805737 p_value_TLLOC: 0.22264570019259458 Mean Squared Error: 160.73341030616513 Root Mean Squared Error: 11.268564267642365 R2 score: 0.12565480673694363 p_value_LOC: 0.08358420208461671 p_value_NL: 0.41529832112296783 Mean Squared Error: 140.26004237515988 Root Mean Squared Error: 10.455747283519015 R2 score: 0.12011243590748252 p_value_TCLOC: 0.25397504455204223 p_value_TLLOC: 0.06703239657171477 Mean Squared Error: 151.67896851633185 Root Mean Squared Error: 11.147679075554874 R2 score: 0.0743011683236761 p_value_TCLOC: 0.3409665216447894 p_value_NL: 0.20016468539352686 Mean Squared Error: 145.25561197406077 Root Mean Squared Error: 10.81962163722644 R2 score: 0.09451296348046742 p_value_TLLOC: 0.2870556032111785 p_value_NL: 0.5307579289827014 Mean Squared Error: 163.9821330620776 Root Mean Squared Error: 11.520824084441141
Looking at the performances of all the models built with couple of predictors such that none of the variables is Cognitive Complexity, none of the generated model shows an acceptable degree of statistical significance.
def show_performances_multivariate_models_for_other_couples():
features_list = ['McCC', 'LOC', 'TCLOC', 'TLLOC', 'NL']
model_performances_info = []
i=0
for couple in itertools.combinations(features_list, 2):
model_performances = [f'{couple[0]}+{couple[1]}']
model_performances += list(map(lambda x:round(x,3),
multivariate_models_other_couples_scores[i]))
model_performances_info.append(model_performances)
i += 1
return pd.DataFrame(model_performances_info,
columns=['Features', 'R2 Score', 'P-value Feature 1',
'P-value Feature 2', 'MSE', 'RMSE'])\
.set_index('Features', drop=True)
other_couples_multivariate_analysis_results = \
show_performances_multivariate_models_for_other_couples()
other_couples_multivariate_analysis_results
| R2 Score | P-value Feature 1 | P-value Feature 2 | MSE | RMSE | |
|---|---|---|---|---|---|
| Features | |||||
| McCC+LOC | 0.120 | 0.556 | 0.284 | 141.290 | 10.567 |
| McCC+TCLOC | 0.098 | 0.114 | 0.562 | 143.236 | 10.693 |
| McCC+TLLOC | 0.139 | 0.198 | 0.168 | 158.011 | 11.207 |
| McCC+NL | 0.121 | 0.096 | 0.268 | 146.635 | 10.641 |
| LOC+TCLOC | 0.109 | 0.079 | 0.809 | 137.231 | 10.454 |
| LOC+TLLOC | 0.148 | 0.149 | 0.223 | 160.733 | 11.269 |
| LOC+NL | 0.126 | 0.084 | 0.415 | 140.260 | 10.456 |
| TCLOC+TLLOC | 0.120 | 0.254 | 0.067 | 151.679 | 11.148 |
| TCLOC+NL | 0.074 | 0.341 | 0.200 | 145.256 | 10.820 |
| TLLOC+NL | 0.095 | 0.287 | 0.531 | 163.982 | 11.521 |
Considering all the models built with all the possible couples of predictors, only the model built using Cognitive Complexity and LOC shows a small degree of goodness in representing the data. However, the p-values of both the predictors are slightly greater than 0.05 (0.083 and 0.093, respectively). All the other models do not show any statistical significance.
df_multivariate_analysis_results = \
pd.concat([cognitive_complexity_multivariate_analysis_results,
other_couples_multivariate_analysis_results],
axis=0)
df_multivariate_analysis_results.sort_values(by=['R2 Score',
'P-value Feature 1',
'P-value Feature 2'],
ascending=False)
| R2 Score | P-value Feature 1 | P-value Feature 2 | MSE | RMSE | |
|---|---|---|---|---|---|
| Features | |||||
| Cognitive_Complexity+LOC | 0.175 | 0.083 | 0.093 | 136.096 | 10.220 |
| Cognitive_Complexity+McCC | 0.160 | 0.081 | 0.141 | 142.810 | 10.485 |
| LOC+TLLOC | 0.148 | 0.149 | 0.223 | 160.733 | 11.269 |
| McCC+TLLOC | 0.139 | 0.198 | 0.168 | 158.011 | 11.207 |
| Cognitive_Complexity+TCLOC | 0.133 | 0.039 | 0.303 | 140.938 | 10.610 |
| Cognitive_Complexity+TLLOC | 0.127 | 0.197 | 0.728 | 164.294 | 11.484 |
| LOC+NL | 0.126 | 0.084 | 0.415 | 140.260 | 10.456 |
| Cognitive_Complexity+NL | 0.122 | 0.096 | 0.476 | 145.068 | 10.716 |
| McCC+NL | 0.121 | 0.096 | 0.268 | 146.635 | 10.641 |
| McCC+LOC | 0.120 | 0.556 | 0.284 | 141.290 | 10.567 |
| TCLOC+TLLOC | 0.120 | 0.254 | 0.067 | 151.679 | 11.148 |
| LOC+TCLOC | 0.109 | 0.079 | 0.809 | 137.231 | 10.454 |
| McCC+TCLOC | 0.098 | 0.114 | 0.562 | 143.236 | 10.693 |
| TLLOC+NL | 0.095 | 0.287 | 0.531 | 163.982 | 11.521 |
| TCLOC+NL | 0.074 | 0.341 | 0.200 | 145.256 | 10.820 |